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Abstract — In this paper, we consider the power allocation of the physical layer and the buffer delay of the upper application layer 
in energy harvesting green networks. We analyze the delay-optimal power allocation problem over fading channels. The total power 
required for reliable transmission includes the transmission power and the circuit power. The harvested power (which is stored in a 
battery) and the grid power constitute the power resource. The objective is to find a policy to minimize the buffer delay under the 
constraint on the average grid power. The policy is a two-dimensional vector with the transmission rate and the power allocation of the 
battery as its elements. In each transmission, the transmitter decides the transmission rate as well as the allocated power from the 
battery, and the rest of the required power will be supplied by the power grid. A constrained Markov decision process (MDP) problem is 
formulated when the data arrival process, the harvested energy arrival process, and the channel process are Markov processes. The 
following two cases are respectively considered. First, the battery capacity is considered infinite. We solve the optimal rate through 
a reduced MDP problem that is only related to the average harvested energy but not the harvested energy arrival process. And then 
the battery's power allocation can be given based on the optimal rate. Second, when the battery capacity is finite, we derive some 
structural properties of the optimal two-dimensional solution through the transformations to the average cost MDP and discount cost 
MDP. Two necessary conditions for the optimal policy are obtained. Moreover, we derive two optimal policies under certain conditions, 
respectively. Finally, we discuss the dimension reduction of the policy under finite capacity. From the simulation results, the interactions 
of the initial system state, the channel, the data buffer length, the data arrival, the harvested energy arrival, and the power grid under 
different policies are observed. 

Index Terms — Green communications, energy harvesting, cross-layer design, power allocation, Markov decision process. 
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1 Introduction 

RAPID wireless communication industry develop- 
ment has led to a dramatic increase of energy 
consumption in wireless networks, and such an increas- 
ing energy consumption produces a series of energetic 
and environmental problems. Recently green commu- 
nications, which aims at enhancing energy efficiency 
and carbon emission reduction, have received consid- 
erable attention |1|-|5[. In the energy-efficient design for 
wireless communications, the total energy consumption 
includes not only the transmission energy but also the 
circuit energy consumption (6J. 

As a preferred choice supporting green communica- 
tions, energy harvesting techniques such as photovoltaic 
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solar cells become popular for the ability to prolong 
the lifetime of the battery and the lifetime of wireless 
networks thereby. There have been a lot of researches 
in wireless networks with energy harvesting nodes. In 
0, an optimal energy management policy for a solar- 
powered sensor node was proposed. The policy uses a 
sleep and wakeup strategy for energy conservation. In 
|8[, throughput optimal and mean delay optimal energy 
management policies were studied for a single energy 
harvesting sensor node. The Shannon capacity of an en- 
ergy harvesting sensor node transmitting over an AWGN 
channel was obtained in 0. In (TO) , the optimal binary 
transmission policies were studied under i.i.d. Bernoulli 
energy arrivals. In ||TT|. the long-term average commu- 
nication reliability optimization problem was studied 
for the system of energy-harvesting active networked 
tags (EnHANTs). In Q21 and Q3, throughput-maximal 
schemes of energy allocation for wireless communica- 
tions with energy harvesting constraints are studied. 

Resource allocation is a fundamental problem in wire- 
less communications [14J. Generally, resource consump- 
tion reduction and quality of service (QoS) improvement 
are two conflicting objectives in a resource allocation 
problem. There has been some interests in analyzing the 
power allocation and delay performance from the cross- 
layer perspective. In [15J and [16|, the tradeoff between 
the average required power for reliable transmission at 
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the physical layer and the mean delay at the network 
layer was studied in fading channels. The adaptive con- 
trol policies utilize information on both queue state and 
channel state, and some structural results for the optimal 
policy were derived. In fl7l and [18|, the authors derived 
the improved results upon these obtained in [16J. They 
considered the optimization problem aiming to minimize 
the delay in the transmitter buffer under an average 
transmitter power constraint. The existence of stationary 
average optimal policy was proved and some structural 
results were obtained. In [19], the fading channel was 
simplified to a static channel, and the explicit optimal 
control policy was characterized. 

In [16J-[19J, only the transmission power is considered. 
However, as shown in (TJ, the transmission strategy 
changes when taking the circuit power into account. 
Then a natural problem is what about the power and 
delay when considering both transmission power and 
circuit power. Meanwhile, as energy allocation of the 
battery plays a central role in the transmission strategy 
of energy harvesting nodes, how the energy allocation 
strategy of the battery will affect the power and delay? 

In this paper, we consider the power allocation in 
the physical layer and the delay performance in the 
upper application layer in green wireless networks with 
energy harvesting nodes. The data are generated in the 
application layer, and placed in a buffer at the trans- 
mitter. The transmitter periodically removes some data 
from the buffer, and transmits the data to the receiver. 
The required power for reliable transmission takes both 
transmission power and circuit power into account, and 
the power resource makes up of the harvested power 
and grid power. The harvested energy arrives randomly, 
and there is a constraint on the average grid power. The 
objective is to minimize the average delay in the buffer 
with a constrained average grid power and random 
battery energy Since the required power for each trans- 
mission can be supplied from both the battery and the 
grid, the policy is two-dimensional, i.e., the rate as well 
as the allocation of the battery energy (the grid power 
allocation is then the total required power minus the 
allocated battery power), in the formulated optimization 
problem. 

Specifically, the main contributions of the paper can 
be summarized as follows. 

• We consider the delay-optimal power allocation in 
the framework of green communications, where the 
power comes from both power grid and harvesting 
devices. When the data arrival process and the 
harvested energy arrival process are the Markov 
processes and the channel process is a Markov 
chain, we formulate the problem as a constrained 
Markov decision process (MDP) problem, in which 
the state and action are defined. The state includes 
the queue state (i.e., the queue length in the buffer), 
the battery state (i.e., the stored energy in the bat- 
tery), the channel state, the data arrival, and the 
harvested energy arrival. The action consists of the 



transmission rate and the power allocation from the 
battery. 

• For the special case that the battery's capacity is 
infinite, we prove that the optimal policy can be 
obtained as follows. We first choose the optimal rate 
by solving a reduced constrained MDP problem, 
which is only related to the average harvested en- 
ergy but not the harvested energy arrival process. 
Then, according to the optimal rate, the allocation 
policy of the harvested energy is given. To solve the 
reduced constrained MDP problem, it is converted 
to be an average cost MDP, and some structural 
properties of the optimal solution are derived. 

• When the battery's capacity is finite, we consider 
the policy as a two-dimensional vector with the 
transmission rate and the power allocation from the 
battery being the elements. Using the Lagrangian 
methodology, the constrained MDP can be relaxed 
to an unconstrained problem (UP), which is an aver- 
age cost MDP. We prove that the optimal solution of 
the UP with a certain Lagrangian multiplier is the 
optimal solution of the original constrained MDP 
Meanwhile, the average cost MDP (i.e., UP) can be 
analyzed by converting to the discount cost MDP 
We verify the existence of the stationary policy and 
derive two necessary conditions for the optimal pol- 
icy. Under certain conditions, the policy that serving 
nothing and allocating no energy from the battery 
is optimal. We also prove that serving everything 
combined with allocating the minimal of the total 
required power and total energy in the battery are 
optimal under certain conditions. 

• Finally, we analyze the relations between the trans- 
mission rate and the power allocation from the 
battery We find that the transmission rate is domi- 
nant, and we propose a conjecture that the original 
problem can be reduced to a MDP problem with the 
policy to be the transmission rate only. 

The remainder of the paper is organized as follows. In 
Section 2, the system model is described, and we formu- 
late the MDP problem. Next, the formulated problem 
is analyzed under infinite capacity and finite capacity 
circumstances in Section 3 and Section 4, respectively. 
Simulations are performed in Section 5. Finally, Section 
6 concludes the paper. 

2 System model and problem formula- 
tion 

We consider a slotted-time model of a point-to-point 
block fading channel. The length of a time-slot is r units. 
The rt-th time-slot is the time interval Wit, (n+ l)r). The 
channel gain remains static in each slot, and changes 
between different slots. The sequence of the channel 
gains is a finite-state ergodic Markov chain {i?[rt]}. As 
shown in Fig. [TJ at the end of the n-th slot, the higher 
layer generates A[n] packets and they are stored in a 
buffer before transmission. It is assumed that each packet 



3 



a\n-\y 



e[n-l] 



nr 



Data arrival from . 
higher layer 
a[n-l] 



Harvested energy _ 
arrival e[n-\\ 



n-th time-slot 



a{n] 



e[n] 



z[n] 

Data buffer i 
I Transmission h\ti\ jL 

q[n] rate r[n] kjj 



(n%l)T 



Allocated power 
from the battery 

w{n\ 

■ e b [n] 



Battery 



Power from the 
power grid 

P g Jn] 



Fig. 1 . System model 



In this paper, we denote the system state as X[n] = 
(Q[n], H[n], A[n], E b [n], E[n]) with state space X and de- 
note the action as (i?[rc], W^n]) with action space A. 
{X[n], (R[n], W[n])} is a controlled Markov process. 

Define a policy ir = (ttq, 7Ti, • • • ) that generates an 
action (r[n], w[n]) with a probability 120][21J at in- 
stant nr. We denote the set of all policies as II. Let 
x[n] = (q[n], h[n], a[n], e b [n], e[n]). The feasible (r[n],w[n]) 
in state x[n] belongs to TZ(x[n]) — {0, 1, • • • , q[n]} x 
W(x[n}) = {0, i, • • • , SiM}0 A stationary deterministic 
policy is 7r = (g, g, ■ ■ ■ ), where g is a measurable mapping 
from X to lZ(x[w]) x W(x[n]). Our objective is to find a 
policy that minimizes the mean buffer delay under the 
long run constraint on the grid power, P. The optimiza- 
tion problem (i.e., the constrained MDP) is given by 



is with b bits and {A[n]} is a finite-state ergodic Markov 
chain. We assume that the transmitter is equipped with 
an energy harvesting device and it can also get power 
from the power grid. The harvested energy arrives at 
each end of the slot according to a finite-state ergodic 
Markov chain {_E[n]}, and the harvested energy will be 
stored in a battery before consumption. There exists a 
long run average constraint on the grid power at the 
transmitter. During the n-th time-slot, the transmitter 
chooses R[n] packets from the buffer and transmits to the 
receiver. We assume the additive white Gaussian noise 
(AWGN) at the receiver is with zero mean and variance 
a 2 . In green communications, the total power required 
for reliable transmissions^ of r packets in a time-slot is 
ID 



P(x,r) 



p T {e 



Or 



A(r), 



(1) 



where x is the system state that will be defined later, 
p > 1 is a constant, 9 = 21n f [ r 2 ' )b with N being the channel 



uses in each time-slot, and 



A(r) 



0,r = 0, 



where C > is a constant. In particular, p = 1 and 
C = when no circuit power is taken into account. In the 
transmission during the ?i-th time-slot, the transmitter 
allocates W [n] power from the battery, and the rest 
power will be supplied by the power grid. Denote Q[n] 
as the queue length of the buffer at instance nr, the 
evolution equation for the buffer length is 

Q[n+l] = Q[n}-R[n]+A[n}. (2) 

Assume that the capacity for the battery is -EmaxH Denote 
the battery's stored energy at instance nr as Ef\n], 
then the evolution equation for harvested energy in the 
battery can be given by 

E b [n +1] = min {E b [n} - W[n}r + E[n], E max ) 

:= {E b [n]-W[n]T + E[n])-. (3) 

1. totally error-free according to capacity arguments 

2. It is assumed that E[n] < E max for n = 0, 1, • • • . 



miri£C := limsup — EJ 

wen „ n 



Lfe=0 



s.t. 



Kl := limsup-E; 



fe=Q 



(4) 



< P, (5a) 



R[k] < Q[k], (5b) 
L W[k]r < E b [k], (5c) 

where P gr id[k] is the power from the power grid, 

P(X[k],R[k]) = Pgridlk] + W[k], (6) 

and the subscript x = (q,h,a,e b ,e) G X denotes the 
initial system state. 



3 Special case: infinite battery capac- 
ity 

In this section, the optimal policy under infinite battery 
capacity is studied. We first investigate how to derive 
the optimal policy. We prove that the optimal rate and 
optimal battery power allocation can be obtained succes- 
sively in Section 3.1. Then, we analyze the properties of 
the optimal rate in Section 3.2. 



3.1 The optimal policy 

When the battery capacity can be considered infinite, we 
have the following optimal policy. 

Theorem 1. When the battery capacity is infinite, i.e., 

Emax = oo, the optimal policy is (r*,w*) where r* = 
(r* [0] , r* [1] , • • • ) is the optimal solution for 



min J* i := lim sup — W, 

7r n Tl 



J2Q[k] 



Lk=0 



(7) 



3. The harvested energy has been discretized. 

4. P ± 0. 
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S.t. 



17, := limsup -E 77 , 

x n n x 



k=0 



1. 



< P + limsup -E „ 



lk=a T 



, R[k] <Q[k], 



(8a) 
(8b) 



where it = (ir , tt 1 , ■ ■ ■ ) generates r[n] at instant n, x = 

(q,h,a), x" = (e b ,e), and X'[k] = (Q[k],H[k],A[k]). And 
w* = (w* [0] , w* [1] , • • • ) is given by 



1 n_1 1 
limsup — ) w*\k] = limsup — E // 

7 h — * ~ n x 



k=0 



with 



w*[k] £ 



0,mmiP(x[k],r*[k]), 



£ 

k=0 

e b [k] 



E[k] 



(9) 



Proof: See Appendix IA] □ 
Remark: In iJU), K £ z's f/ze constraint on the average power 
from the grid, W[k]r < Eb[k] is the constraint on the step 
energy (or power) from the battery. And the total power can 
be obtained from both the grid and battery in each step (i.e., 
©). If the battery's capacity is infinite, the constraint on 
P(X [k], R[k]) can be averaged (see L\ in (Bab ) while solving 
the optimal transmission rate. 

In the following, by analyzing the solution of the 
reduced MDP (i.e., I0), we derive some properties of 
the optimal rate. 

3.2 Properties of the optimal rate 

Let 



Ph = limsup — E 



E[k] 



Lfc=0 



which is the average harvested power. In addition, de- 
note P = P + Ph- can be re-expressed as 



mm J i 

I x 

7T 

S.t. 



lim sup ■ 



1. 



,k=0 



(10) 



17, := limsup -E x , 

x n n x 

<P, 
{ R[k] <Q[k}. 



n-l 



J2P(X[k},R[k}) 



Lfe=o 



(11a) 
(lib) 



Let f K (x ,r) = q + kP(x ,r). Define the unconstrained 
problem corresponding to (fTOb as 



min J™ (x ) := limsup ■ 



J2Ux'[k],R[k}) 



,fc=0 



(12) 



Then, we can prove the following lemma that reveals 
the relation between the constrained and unconstrained 
problem. 



Lemma 1. There exists a n > for which the optimal 
policy obtained for (121 is also optimal for (10| |. 

Due to the space limitation, the proof is omitted here, 
and so are all proofs thereafter in this section. All omitted 
proofs can be found in |22]. 

Based on Lemma |TJ the solution of (0 can be trans- 
formed to solve the unconstrained problem |(12)| . In the 
following, we focus on the solution of the unconstrained 
problem. 

As the unconstrained problem (12)1 is an average cost 
MDP, it can be studied by analyzing its corresponding 
discount cost MDP |24). Define a discount cost MDP 
with discount factor e <= (0, 1) corresponding to | [T2)| for 
initial state x = (q, h, a) with value function 



V e (q,h,a) 



^e k (Q[k} + KP{x'[k},R[k]) 



.k=0 



(13) 



Before analyzing the unconstrained problem, we have 
the following results. 

Property 1. V e (q,h,a) is increasing and convex in q. 

First of all, the following lemma reveals the existence 
of optimal stationary deterministic policy for the uncon- 
strained problem. Furthermore, it derives the relations 
between the optimal solution of the unconstrained prob- 
lem and the optimal policy for the discount MDP. 

Lemma 2. There exists a stationary deterministic policy 
qi that solves dT2|l (i.e., optimal policy), and for any 
x = (q, h, r) and given any sequence of discount factors 
converging to one, there exists a subsequence {e m } of 
discount factors and a sequence x m — > x such that 
gi(x) = lim^oo g Sm (x m ). 

For each state-action pair (x = (q, h, a), r), let u = q—r, 
u 6 {0, 1, • • • , q}. Then, u(x ) also defines the stationary 
deterministic policy. Let D E (q, h, a) = V e (q, h, a) — V e (q — 
l,h,a) and 

T(u,h,a) = e 8u { limeE h JD s (u + A,H,A)] 

- K[A(?-« + l)-A(g-ti)]}. (14) 

In the following, we give the structural properties 
(Lemma [3] - Lemma [6]l of the optimal policy for the 
unconstrained problem. 

Lemma 3. Let u*(q,h,a) be the solution of 

1) < T(u + l,h,a). (15) 



1 (u,n,a) < — - — e 



The solution of l(T2) l is u(q, h, a) = min{u*(q, h, a), q}. 
Lemma 4. u*(q, h,a) is non-decreasing in k. 
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Lemma 5. Let ip — Eh >a [^-] and v = E h [jj].u*(q,h 7 a) 
can be lower bounded by 



9 { 2h ) ■ 

where b\ = npa 2 ip, 62 = 1 + n{C — po 2 v), 63 



(16) 



f?-e i(e e - 1). And it is upper bounded by@ 

-ln( K(>u2 e e He e -l) 
9 \h(l-KC) 1 ' 



(17) 



In addition, the optimal number of transmitted packets 
r(q, h,a) = q — u(q, h, a) is non-decreasing in q. 

Lemma 6. When A[n] and H[n] are independent and 
identically distributed (i.i.d.), T(u, h, a) is independent 
of (h,a). Define U(y) as the value of u that solves 
T(u) < e 6y < T{u + I), the optimal policy is u*{q,h) = 

min |q, ll(q — | In ~ ) ) |' m conclusion, if 

kC < limeE M [l> e (3 + A, ff, A) 

e-^^q-l + Aff,^)], (18) 
the optimal solution is 



T(q) 



u(q, h) 



n I / 

U ' h ^ KpCT 2 e s 5r(e s -l) ' 

u*(q,h), otherwise. 



K / 9cr 2 e e 9(e s -l) ' 
T(0) 



Remark: It can be derived that 

T(q) 
npa 2 e 9 i(e e - 1) 



lim eE A)0 [D e (g + A, iJ, A)] - kC 
npa 2 (e e — 1) 



and 



T(0) 



lim eE hya [D e (A,H,A)} 



npa 2 e e i(e - 1) Kpa 2 e e i(e e - 1) 

T/ze«, we can see f/ze impact of p and C on the optimal 
policy. When p > 1 and C > (i.e., extra circuit power 
is considered), better channel condition is needed to serve all 
data, and the upper bound of the channel condition to serve 
nothing becomes larger. In another word, it is easier to serve 
nothing and harder to serve all. This can be proved as follows 
also. From ( fl2] >, (compared with p — 1 and C = 0, i.e., no 
circuit power is considered,) p > 1 in addition with C > 
are equivalent to increasing n. According to Lemma® u* will 
be non-decreasing. In other word, less data will be served. 

4 General case: general battery ca- 
pacity 

In this section, we investigate the optimal policy under 
finite battery capacity The original constrained MDP can 
be analyzed by converting to an average cost MDP, and 
the average cost MDP can be studied through its corre- 
sponding discount cost MDP. We first prove the existence 
of the stationary policy. Next, the optimal policy for the 

5. K < i. 



discount cost MDP is analyzed in Section 4.1. Based on 
the results of Section 4.1, we investigate the optimal 
policy of the average cost MDP in section 4.2. In Section 
4.3, we derive the optimal policy for the original MDP 
by proving the relations between the original MDP and 
the average cost MDP. Finally, we discuss the dimension 
reduction of the two-dimensional policy in Section 4.4. 

Define P gr id {x, r, w) = max{p(j, r) — w, 0} := (P(x,r) — 
w) + and fp(x, r,w) = q + (3P grid (x, r, w) with /3 > 0. The 
original constrained MDP can be converted to a family 
of the following unconstrained problem (UP^). 



min Jg(x) :— lim sup — 
T n n 



^2fp(X[k],R[k],W[k]) 



k=0 



• (19) 



Remark: UPp is an average cost MDP. Its optimal solution is 
called the average cost optimal policy. 

Define a discount cost MDP with discount factor a cor- 
responding to UP for each initial state x = (q, h, a, ej, e), 
with value function 



V a (q,h,a,e b ,e) 



minE; 



]T a k (Q[k] + pP grid (X[k],R[k],W[k])) 



k=0 



(20) 



The optimal solution for the discounted problem is 
referred to as a discount optimal policy. 

The following lemma reveals the existence of the 
stationary policy. 

Lemma 7. There exists a stationary deterministic policy 
7t = (r, w) that solves UP/3, and it can be obtained as a 
limit of discount optimal policies as the discount factor 
increases to one. 

Proof: See Appendix [B] □ 
According to Lemma UP,g can be solved through its 
corresponding discount cost MDP. Then, we first analyze 
the discount optimal policy. 



4.1 Optimal policy for the discount cost MDP 

In this subsection, we investigate the optimal pol- 
icy for the discount cost MDP. For state-action pair 

(x = (q, h, a, eb, e), (r, w)), let u = q — r and 77 = e& — wt, 
(u(x),rj(x)) can also define a stationary policy. Then, the 
discounted cost optimality equation becomes 



V a (q,h, a, e&, e) = min <q 

U£{0,1.'" ,q},7je{0,l,- ,e b } 



l) + A(q-u) 



66 - J7-I + 



*E ft , a , e [V a (u + A,H,A,{r] + E)~,E)] \, (21) 
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and the corresponding value iteration algorithm (or sue- where 
cessive approximation method) is 



V a , n {q, h,a, e b ,e) = min iq _ e e« 

ue{o,i.'" ,g},ne{o,i>- ,e b } 



Zi(u,h,a,rj,e) 

aE h , a , e [G 1 {u + A,H,A,{r) + E)-,E)) 

(27) 



+ /3[p^(e^-"^l) + A(g- M )-^-^] + + [A(g — «) — A(g — « + 1)] 

+ aEfo, , e [V a ,„_i(u + A, H, A, (77 + £)-,£)] | with 

(22) Gi(g,/i,a,e 6 ,e) = 

withV r a;0 (3,/i,a,e 6 ,e) = 0. V Q (g, ft, a, e fc , e) - V Q (g - 1, ft, a, e 6 , e), (28) 

The following properties (Property ^Property 2) give 
the properties of the value function V a (q, ft, a, eft, e). Z 2 (it, ft,, a, 77, e) 



Property 2. Va(q, ft, a, e b , e) is an increasing function of = aE h , a , e [G 2 (u + A, H, A, (rj + E) , E)] 

q- 



(29) 



with 



G 2 (<7, ft, a, e fc ,e) 



Proof: See Appendix O □ 

Property 3. V a (q,h,a,e b ,e) is a non-increasing function 

of e b . V a (q, h,a,e b ,e) - V a (q,h,a,e b - l,e), (30) 

Proq/: See Appendix O □ and 

In the practical case, the allocated harvested power 
will not surpass the required total power. Thus, we 
assume the (u, 77) always guarantees that 



Z 3 (u,h,a,r],e) = 

e eu aE h<a<e [G 12 (u + A, H, A, (77 + E)~,E)] 

P^(e^-l) + A( g -.)>^. (23) + mq _ u) _± {q _ u + l)]+ P_ 

with 



(31) 



Based on this assumption, P gr id{x , r , w) = P(x,r) — w. 
The following property gives the joint convexity of 

V a (q,h,a,e b ,e) in (q,e b ). G 12 (q,h,a,e b ,e) = 

Property 4. V a (q, ft, a, e b , e) is convex in (q, e b ). V a (q, ft, a, e b , e) — V a (q — 1, ft, a, e b - 1, e). (32) 

Proof: See Appendix [Ej □ Proq/: See Appendix [F] □ 

The discount optimal solution will be discussed in the Remark: Lemma [9] reveals a necessary condition for the 

following. optimality. When (u*,rf) is on the boundary of the feasible 

„ T , , . , , . , . , set, corresponding conditions can also be obtained by following 

Lemma 8. In state x = (o, ft, a, e b , e), (u(x), 77(2)) is not ,, , r , T rrn 

,, .. ,. . ... ' ' ( ; „ v /\ the proof ot Lemmam 

the discount optimal solution it u(x) f= and 77(0; J + e > r J J 

E max - Lemma 10. For x = (q, ft, a, eh, e) satisfying 

Remark: Lemma\E\ gives a sufficient condition for the non- e b a 2 9q e 

optimality. Meanwhile, Lemma \8\ can be also viewed as the 1 ^ ' i a > Tmax i > T \ x i 5)i> e J > PP ^ e l e J 

necessary condition for the optimality. an( ^ 

Lemma 9. Denote the discount optimal policy in state e b —(3 

x = (q,h,a,e b ,e) as (u*(x), V *(x)). Then, (u* (x) , v* (x)) Z 2 (0, ft, a, Tmax{0, - - P(x, g)}, e) > — , 

satisfies the following inequality array ^ r <± _ p{x q)}) is the digcount optimal polky 

a 2 „ „ In addition, for (o, ft, a, eb, e) satisfying 

2i(u*,M,»?*,e) <p>^-e 9 V-l) 2 

< Z 1 ( U * + l,ft,a,77*,e), (24) Z 1 (q, ft, a, e 6 , e) < p>^-e*V - 1) 

Z2(u*,h,a,rj*,e) < — Z 2 {q,h,a 7 e b} e) < 



r 

< Z 2 (u,h,a,rj* + l,e), (25) 



T 

(g, e^) is the discount optimal policy. 



Z fw* ft a * e) < /3 — e e9 (e e - 1) Pr °°^ See A PP endix E2 □ 

3( , , , 77 , J _ pp ^ e (_e 7 Remark: (0, r max{0, — f(a;, ?)}) means serving all the 

< Zs(u* + l,h, a, rj* + l,e), (26) data in the buffer, if the required power is less than the power 
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that can be supplied by the battery, allocate all the required 
power from the battery. Otherwise, allocate all the battery's 
energy (the rest of the required power will be allocated from 
the power grid), (q, e b ) means serving nothing and allocating 
nothing from the battery. 

4.2 Optimal policy for the average cost MDP 

First, Lemma [8] still holds for the average cost MDP. Next, 
based on Lemma [7] and Lemma |9J we have the following 
lemma. 

Lemma 11. Given state x = (q, h, a, e b , e), the average 
cost optimal policy (u* (x) , r/* (x)) should satisfy the fol- 
lowing inequality array 

2 

Zx{u*,h, a, jf, e) < /3p^-e eq (e e - 1) 
h 

< Zi(u* + l,h,a,r)*,e), (33) 

Z 2 (u* ,h,a,r]* , e) < — - 
r 

< Z 2 (u,h,a,ri* + l,e), (34) 



Z 3 (u*,h,a,r)*,e) < j3p— e 6q (e e -l) 
h 

< Z 3 (u* + l,h,a,rf + l,e), (35) 
where Zi(u,h,a,rj,e) = lim Z±(u, h, a, i], e), 

a— > 1 

Z 2 {u,h,a,r],e) = lim Z 2 (u, h, a, rj, e), and 

a— >1 

Zs(u, h, a, ?7, e) = lim Z%(u, h, a, r), e). 

a— >1 

Combining Lemma [7] and Lemma [TOj we derive the 
following lemma. 

Lemma 12. For a; = (q, h, a, e&, e) satisfying 

2 

Zi(0,/i,a,rmax{0,— -P(x,q)},e) > (3p?-e eq {e e - 1) 
t a 

and 

e& — /3 

2 2 (0,/i,a,Tmax{0, P(x,q)},e) > , 

r r 

(0, r max{0, ^ — q)}) is the optimal policy. In addi- 
tion, for (q,h,a,eb,e) satisfying 

2 

Z 1 (g,/i,o,e 6 ,e) < /3p^e e «(e e - 1) 

Z 2 (q,h,a,e b ,e) < — , 
r 



and 



(<7, eb) is the optimal policy. 

4.3 Optimal policy for general battery capacity 

The following lemma gives the relation between XJFp 
and the original problem. It reveals that the original 
problem has the same solution as UP^ with a certain 

Lemma 13. There exists a (3 > for which the optimal 
policy obtained for UP^j is also optimal for the original 
constrained MDP . 



4.4 Reducing the policy's dimension 

The rate allocation r and the power allocation from the 
battery w are coupled together, they affect each other. 
However, if we assume that rate r has been chosen, then 
the required total power has been fixed. In this case, 
we will allocate as much power as possible from the 
battery to meet the required total power, i.e., the greedy 
policy for the battery power allocation. This is because 
the power from the battery is free@ We can guess that the 
greedy allocation strategy of battery power is optimal. 
However, it is difficult to prove. The difficulty lies in 
the fact that the remaining battery energy will affect the 
future action and cost (e.g., ll45ll). On the other hand, once 
w has been fixed, the power allocation from the power 
grid can also affect r. In summary, when r is chosen, 
the optimal w* is the greedy policy. By contrast, if w 
is fixed, the optimal r is not fixed, we need to solve the 
power allocation from the power grid to find the optimal 
r* . Thus, we can reduce the policy from (r, w) to r. We 
have the following conjecture. 

Conjecture 1. Let n r = (r[0], r[l], • • • ), the original op- 
timization problem can be converted to the following 
problem. 



min B™ r :— lim sup ■ 



'n-l 



s.t. 



1 . 



KI- := lim sup -EJ 

n n 

R[k] <Q[k], 



^ Pgrid [k] 



U'=0 



(36) 

<P, (37a) 
(37b) 



where 



Pgr-id[k) — 



P(X[k],R[k\) -mini P[X[k],R[k] 



E b [k] 



(38) 



and the evolution of energy in the battery becomes 

E b [k + 1] = 



E b [k] -Tmin{P(X[k],R[k] 



E b [k] 



E[k\ 



(39) 



Proof. See Appendix [H] 



□ 



Remark: The policy can be reduced in dimension ((r, w) —> 
r). If the stated (3 in Lemma [13] satisfying ft ^> 1, Conjecture 
1 can be proved based on ( l45l > in addition with Lemma\7\and 
Lemma\l3\ From Conjecture 1, we can also arrive at Theorem 
1 for the infinite battery capacity case. 

In the whole paper, we assume that the power from 
grid and harvester is sufficient to stabilize the queue 
length. The stability issue such as the bounds on average 
harvest rate or average packet arrival rate will be studied 
in future work. 

6. please refer to 1191 . 
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Fig. 2. The mean buffer length performance for different 
initial queue length and battery energy. The pairs of 
initial queue length and battery energy for infinite are 

(9,46), (5,28), (4,5), (1,20), (2,12), and those for finite 
are (1,23), (9, 33), (5, 11), (9, 46), (2, 33). 



5 Numerical results 

In this section, simulation results are presented. In all 
simulations, we set r = 1, b = 1, N = 3, C = 1, p = 1, 
and <7 2 = 1. 

Fig. |2] shows the mean buffer length performance with 
respect to initial queue length and initial battery energy. 
In the simulation, only the harvested energy is available 
and no grid power can be utilized (i.e., P = 0). We 
use the greedy strategy, i.e., r(x) = mm{q, P^ 1 (eb)} Qin 
state x = (q,h,a,eb,e). We consider the i.i.d. case of 
H, A, and E. H takes values {0.4, 0.6, 1} with prob- 
abilities {0.3,0.5,0.2}, respectively. A takes and 10 
with equal probability 0.5. E takes values {0, 50, 100} 
with probabilities {0.3, 0.5, 0.2}, respectively. The initial 
queue length and initial battery energy are randomly 
generated. For each pair of initial queue length and 
initial battery energy, 10 6 time-slots are averaged. For 
the finite capacity case, the capacity is set to be 200. 
From the figure, we can see that the mean buffer length 
performance is almost irrelevant to initial queue length 
and initial battery energy. Consequently, in the following 
simulations, we set fixed initial queue length and initial 
battery energy for each simulation. Another thing we can 
observe is that the infinite battery capacity has a better 
mean buffer length performance than finite capacity. 

Fig. [3] illustrates the mean buffer length performance 
for different battery capacities when the grid power 
is unavailable. The simulation settings are the same 
as those in Fig. [2] The buffer length is averaged over 
10 5 time-slots. It can be observed that the mean buffer 
length decreases when the battery capacity increases. 
It is because when the battery capacity is small, the 

7. -P _1 ( ) i s trie inverse function of P(x,r) with respect to r. 




500 



Fig. 3. The mean buffer length performance with different 
battery capacities. 



battery is easier to overflow, and more harvested energy 
will be wasted. We can also find that when the battery 
capacity increases over some level, the mean buffer 
length decreases very slowly Since the i.i.d. processes 
of H, A, and E as well as the strategy are predefined, 
once the battery capacity is larger than some value, the 
overflow of the battery is with a very low probability. 

Fig- IH depicts the battery's overflow probabilities with 
different A when the grid power is unavailable. The 
simulation configuration is the same as that in Fig. [2] 
except that A takes and 2* A with equal probability 0.5. 
The overflow probability decreases when we increase the 
data arrival rate. Since r(x) = min{g, P^ 1 (eb)}, when we 
increase A, the probability of allocating all the battery 
power in a step will increase. Then the overflow will be 
suppressed. When A is large (such as A = 6, • • • ,12 in 
the figure), the overflow probability becomes 0, i.e., no 
overflow. 

Fig. [5] presents the mean buffer length performance 
with respect to average channel gain, H. In the simula- 
tion, H takes values {0.2, (H - 0.3 * 0.2 - 0.1 * l)/0.6, 1} 
with probabilities {0.3, 0.6, 0.1}, respectively (the average 
value is H). Other simulation settings are the same as 
those in Fig. |2] For each H, 10 6 time-slots are averaged. 
Observe that the buffer length performance improves 
when we increase H. Moreover, we can see that the 
buffer length decreases harshly when H is small, and 
when H is above a certain value (e.g. 0.4 in the figure), 
the decrease becomes moderate. This can be explained 
as follows. First, we can find that P _1 (-) is increasing 
with h. When H is small, q > P^ 1 (eb) with a high 
probability, i.e., r(x) = P^ 1 (eb). The current buffer 
length q — r(x) = q — P^ 1 (eb) will decrease with the 
increase of h. Thus, the mean buffer length decreases. 
Once H is larger than a certain value, q < P^ 1 (eb) with 
a high probability. Then, r(x) = q and the current buffer 
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Mean data arrival 

Fig. 4. Overflow probabilities for different values of mean 
data arrival, A 




Average channel gain 



Fig. 5. The mean buffer length performance for different 
values of H. 



length becomes zero with a high probability. In this case, 
the increase of H will not have great effects on the mean 
buffer length. 

Fig. [6] plots the average grid power consumption with 
respect to the mean data arrival, A. In the simulation, 
all the data are served at each time-slot. If the required 
power is not greater than the battery power, then all the 
power will be supplied from the battery and no grid 
power will be used. Otherwise, allocate all the battery 
power, and the rest will be supplied from the power grid. 
That is to say, for state x = (q, h, a, e&, e), the strategy is 
(r = q,w = min{e&, P(x, r)}). A takes and 2 * A with 
equal probability 0.5. The mean grid power is average 
over 10 5 time-slots. We can observe that when A is small, 
the grid power consumption is zero. However, when 




Mean data arrival 



Fig. 6. Average grid power consumptions for different 
values of A 

A is large the grid power consumption grows rapidly 
with the increase of A roughly according to exponential 
relation. This can be explained as follows: when A is 
small, the required power is small and the battery can 
supply the power. Then no grid power will be con- 
sumed. Once A is large, the required power is much 
larger than the battery power, and the grid power be- 
comes the main power source. Since the required power 
roughly varies with the transmission rate according to 
the exponential function, the grid power consumption 
varies exponentially with A. Meanwhile, we can see that 
the larger capacity has the better performance. The larger 
the capacity is, the less the grid power is consumed. 

Furthermore, from Fig. [6j it can be derived that if A is 
less than a certain value, the grid power will be less than 
a certain value. Since the strategy in Fig. [6] is optimal for 
the buffer delay minimization without the average grid 
power constraint, if A is less than some value to make 
the average grid power be no more than the constraint, 
i.e., the average grid power constant is satisfied, then 
the strategy is also optimal when considering the grid 
power constraint. For example, when P = 300, according 
to Fig.0 the strategy is optimal for A = 1, 2, • • ■ ,6. When 
P = 100, the strategy is optimal for A — 1,2, ■• • ,5. The 
reason is that when the average power grid plus the 
harvested power is large enough to serve all the data, 
the serving all is optimal. When P — 300, the average 
power grid plus the harvested power is large enough to 
serve all the buffer data with data arrival A = 1, 2, • • • ,6, 
so serving all the data is optimal. 

6 Conclusion 

In this paper, we have studied the power allocation 
of the physical layer together with the optimal mean 
buffer delay of the upper layer in green networks with 
energy harvesting nodes. The physical power alloca- 
tion contains two aspects: power allocation from the 
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power grid and power allocation from the battery. For 
the purpose of modeling and analyzing the conflicting 
relation between power and delay, we formulate a con- 
strained MDP with a two-dimensional policy. Infinite 
battery capacity case and finite battery capacity case 
are respectively investigated. Structural properties are 
derived, and optimal policies are obtained under special 
circumstances. Specifically the lower and upper bounds 
of the optimal rate have been obtained for infinite battery 
capacity, and the closed-form optimal rate is derived for 
the i.i.d. setting. For the finite capacity, we derive two 
necessary conditions for the optimal policy. Moreover, 
we get the optimal policy for a certain system state. 
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Appendix A 

Proof of Theorem Q] 

Since Q[k] is determined by the action i?[n]J§ which is 
related to P(X[k], R[k]) according to Q}, we first analyze 
the constraint on P(X[k], R[k]). Substituting $6$ into K£, 
we have 



Kl = limsup -E£ 



J2(P(X[k],R[k])-W[k}) 



k=0 



or equivalently, 



1. 



limsup — 



n-l 



J2P(X[k],R[k) 



,fc=0 



1. 



< p + limsup-EJ 



'n-l 



J2w[k] 



k=0 



(40) 



with W[k] < Then, the action W[k] should reach 

its upper bound to get the largest feasible set of the 
optimization problem, i.e., W[k] = Thus, 



limsup — EJ 



n-l 



Y,P(X[k],R[k}) 



fe=0 



< p + limsup-E" 

n 



'n-l 

E 

Lfe=0 



E b [k] 



(41) 



8. R[n] is the only controllable factor in {2). 

9. {40) becomes looser compared with the original constraint since 

Pgridlk] = P(X[k], R[k]) - W[k] > 

is not taken into consideration, but we will show that it does not affect 
the optimal solution. 



If the battery can be considered to have infinite capacity 
to store the harvested energy, the battery's energy evo- 
lution becomes 



E b [n + 1] = E b [n] - W[n}r + E[n]. 

As E b [n] - W[n]r = 0, we have E b [n + 1] 
Therefore, 



lim —I 

n n 



n-l 

E 

.k=0 



E b [k] 



lim -E^ 

n n 



n-l 

E 



E[k] 



k=0 



(42) 

E[n}. 

(43) 



Consequently, the constraint becomes 

rn-l 



lim sup —EZ 

n 



J2P(x[klR[k]) 



,k=0 



< p + lim-E£ 

n n 



n-l 

E 

.fc=0 



E[k] 



(44) 



We can reduce the state to be x and X [k] in the left part 
of Jilt . Meanwhile, the initial state can be reduced to be 
x and the superscript 7r can be removed in the right 
part of 104). Based on the above analysis, we claim the 
optimal r* is the solution of 10. When the optimal rate 
allocation r* is obtained, the optimal w* can be given by 

©■ 

In solving r* , we consider the average power con- 
straint on the total transmission power (|8al l, and we 
do not consider the step power constraint on the total 
transmission power P(X[k], R[k]) — W[k) > 0. However, 
P(X[k], R[k}) — W[k] > can always hold when we 

1oh, 



< p allocate the harvested power according to l[9]l£j i.e., we 
put the step constraint P{X[k],R[k]) ~ W[k] > in the 
allocation of the harvested power. That is to say, even 
when considering P{X[k], R[k])-W[k] > 0, (r*,w*) with 
r* given by (0 and w* given by 10 is also the optimal 
policy. When we loosen the constraint on P(X[k], R[k}) 
to be in the average form as L^, , we obtain the optimal 
rate r*. We claim r* is also feasible in the original 
constraint if we set a proper power allocation policy for 
the battery (e.g., ©). That is because the grid power can 
always complement the gap between the battery power 
and the required power for r* in each step and there is no 
waste of battery's energy, i.e., no overflowFl Therefore, 
r* is optimal for the original problem. Then the proof of 
the theorem completes. 



Appendix B 
Proof of Lemma |7] 

We prove the lemma by applying Theorem 3.8 in [23]. 
First, we can prove that the conditions of Proposition 
2.1 in E31 holds. Next, the discounted cost optimality 

10. |9} also guarantees l5cl 

11. The grid power has no constraint in one step. 
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equation (2U for V a (x) is 
V a (q,h,a,e b ,e) 



mm < q 

A ... e & 1 



re{o,i,--- ,9},<»e{o,i,-,f } 



+ /3[p^(e 9r -l) + A(r)-u.r 
+ aE h>a , e [V a (q-r+A,H,A,(e b 



WT - 



E)-,E)] 



(45) 

We can see that V a (q,h,a,e b ,e) is increasing in q and 
non-increasing in e b given (h, a, e) since the larger the ini- 
tial buffer the larger will be the cost to go, and the larger 
the initial battery energy the smaller will be the costf^l 
Thus, arginf^eA- V a (y) = (0,h o ,a o ,E max ,eo) := x , i.e., 
the infimum is obtained when the system begins with an 
empty buffer, a full battery, and for some channel sate 
ho, arrival state ao, and harvested energy arrival state eo- 
When the buffer is empty, the set of feasible rate is 
{0}. Then f(xo, 0, w) = 0, we get 

V a {x ) = min 

«)£{o,i - , Bm T ans } 

aE ho ,a a ,e„ [V a (A,H,A, (E max - wr + E)~ , E)) 

= aE ho , ao , eo [V a {A,H,A,E max ,E)]. (46) 

Meanwhile, since policy (q, 0) is feasible for state 

(q, h, a, e b , e), then 

2 

V a (x) < q + p^-(e e "-l) + C 
h 

+ oEh.a.e [V a (A,H,A, (e b + E)~,E)] . 

(47) 

Let the system start in state (a, h, a, e b + e, e), we take 
the action r[n] — a[n] and w[n] < e[n] for all n. Let 
£(h,a,e b ,e) be the expected number of slots to hit the 
state (a , h Q ,aa, E max , e o )0 Observe that £(h, a, e b , e) is 
finite. Let 

Cmax = max <^ a + p^-(e 9a - 1) > + C. 

h,a y h J 

Applying the Wald's lemma [25J, we get 

aE M , e [V a {A,H,A, {e b + E)~,E)] 

[V a (A,H,A,E 

max ' 

E)] 

= c max ^(h,a,e b ,e) + V a (x ). (48) 

In |g8), we have used (E max + E)- = E max . Next, 
combining (|47l l and J48l , we have 

2 

V a {x) < q + p?-(e e i -1) + C 
h 

£(h, a, e b , e) + V a (x ). (49) 



Third, there exits a policy ir e A and an initial state 
x £ X such that Jp < oo in the practical problem. 
Otherwise, the cost is infinite for all policies and any 
policy is optimal. 

Based on the above analysis, the conditions in Theo- 
rem 3.8 in [23J hold, and then we prove the lemma. 

Appendix C 

Proof of Property [2] 

We verify the increasing property by induction. Accord- 
ing to (|22) l. V a fi = 0, and V a ,\ = q- The increasing prop- 
erty in q holds. Assume V a , n -i(q, h, a, e b , e) is increasing 
in q. Fix (h, a, e b , e), in the state (q+1, h, a, e b , e), the set of 
feasible u is {0, 1, ■ • • , q + 1} whereas it is {0, 1, • • • , q} for 
state (q, h, a, e b , e). Consider state (q + 1, h, a, e bl e), let the 
optimal action be (u* , rf) with u* S {0, 1, • • • , q}, hence 



V a ,n(q + 1) h, a, e b , e) = q + 1 + f3 



x \p—(e e ^ +1 ' u ">-l) + A{q + l-u*)~ 
h 



e b ~r] ,+ 



Thus, 



V a (x)-V a (x ) < q + p—(e e i-l) + C 



+ aE h ,a,e[V a , n -. 1 (v* + A,H,A ) {ri* + E)-,E)]. (51) 
As (u*,r]*) is feasible in state (q,h,a,e b ,e), 
V a , n (q,h,a,e b ,e) < q 

+ p[p^ {e ^'^-l) + A(q-u*)- e -^] + 
+ aE M , e [V a , n -i(u* + A, H, A, (rf + E)~,E)] 
< V a<n (q + l,h,a,e b ,e). (52) 

If (u*,T)*) with u* =q + l, 

Va,™(g + 1, h, a, e b ,e) = q + l 

+ aE M , e [V a ^i(q + l + A,H,A,{r)*+E)-,E)\ . 

(53) 

Meanwhile, since (q, rf) is feasible in state (q, h, a, e bl e), 

V a<n (q,h,a,e b ,e) < q 

+ aE M>e [V a>n - 1 {q + A,H,A,{ V * + E)~ ,E)] 

(a) 

< V a ^ n (q + l,h,a,e b ,e), (54) 
where (a) holds since the induction hypothesis. 

Appendix D 

Proof of Property [3] 

We verify this by induction. According to j22l l, V a fi = 
0, and then V a ,i = q- The non-increasing property 
holds. Assume V a , n -i{q, h, a, e bl e) is non-increasing in 
e b . Given (q,h,a,e), consider state (q,h,a,e b ,e), let 
(u*,tj*) be the optimal policy, i.e., 



+ Cmaxtih, a, e b , e) < oo. (50) V a , n (q, h, a, e b , e) = q 



12. See the formal proof in Property [2] and Property [3] 

13. When w[n] < e[n], E max is the absorbing state of the battery 
energy. 



+ f3[p^(e^-^ ~1) + A(q - «*) - (e b - ry*)/r] + 
+ aE h>a>e [V a> n-i(u* + A,H,A,(r)* + E)-,E)]. (55) 
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For state (q,h,a,e b + l,e), (u*,rf) is feasible, then we 
have 



V a , n (q,h,a,e b + l,e) < g + /3 

x [p^(e 9 ^-^ - 1) + A(g - «*) - (e 6 + 1 - r?*)/r] 



> 



p 



1 + (1 - <£)g 2 

^( e W<n-«i)+(i-<M<e-u3)] _ 



+ aE M , e [V a ,n-i(«*+A,lf,A,(»7 , +-B)~.-E)] 
< V^.nfe ft, a, e h ,e). 



(56) 



Appendix E 

Proof of Property 3] 

First, we prove the following proposition. 

Proposition 1. For <j> e (0,1) and Vxi,X2,y, 
^>min{a;i,y} + (l-0)min{x2,2/} < mm{(/)Xi + (l-<j))x2,y}- 

Proof: The proposition can be proved by considering 

min{a;i, x 2 } > y, max{xi, x 2 } < y, and min{xi, X2} < y < 
max{ xi , ^2 }/ respectively. □ 
The convexity is proved by induction. For n = 0, 
V Q ,o = 0, and it is convex. Assume V^, n _i((?, ft, a, ef,, e) is 
convex in (q, e b ). Fix (g, ft, a, e b , e), let (tti, 771) and (1*2, 772) 
be the optimal policy for (qx,ebi) and (q 2 , eb 2 ). Then, we 
get 

(t>V atn (qi,h,a,e b i,e) + (1 - (/))V a:n (q 2 ,h,a,e b 2,e) 

2 

9i + /3(p^(e e to- ui ) - 1) + A( gi - mi) 
- e -^-^)} + (1 - 0)fe + /?(^( e ^--) - 1) 

T J ft 

+ A( 92 -M2) )] 

T 

+ QE hta>e ^4>V a ^ 1 {u 1 + A,H,A,{r] 1 +E)-,E) 
+ (1 - tfJVa.n-i (« 2 + A, H, A, (772 + E)~,E) 



P 



> 4>qi + (1 - <t>)q2 

2 

pf_( e W<7i-«i)+(i-<A)(92-u2)] _ 
ft 

+ A(0(q 1 -7i 1 ) + (l-0)(g 2 - M2 )) 
- -OKefti -171) + (1 -0)(e 62 -%)) 



1) 



V^, n -l(<Awi + (1 - </>)tt2 



+ A(0(gi -ui) + (l-0)(g 2 -w 2 )) 
- -(0(e&i - ?7i) + (1 - <t>)(e b2 - 772)) 



+ aE hi a ie V a>n -l(^Ul + (1 - (j>)U2 

+ A,H,A,{<j> m + {l-0) m +E)- 7 E) 

(d) 

> V a ^ n (4>qi + (1 - 4>)q 2 , ft, a, 0e w + (1 - 4>)e b2 ,e), 

(57) 

where (b) holds because of the convexity of e e ( q ~ u ^ + 
A(q — u) (with respect to u) and V a ,n-i(q,h,a,e b ,e), (c) 
holds because of Proposition 1 as well as Property|3j and 
(d) holds since (<fmi + (1 — (f>) «2 , 4>Vi + (1 — 4>) % ) is feasible 

for <f>(qi,h,a,e b i,e) + (1 - 0)(g 2 , ft, a, e &2 , e). 

Appendix F 
Proof of Lemma |9] 

Let 

S^u,^) = 17 

+ /3[^(e^-">-l)+A(g-u)-^^" 
L ft r J 

+ aE fc , , e [+ q (tj + A, If, A, (77 + £)-,£)] . (58) 
First, we have 



S(u + 1,7/) - S(u, 77) = pp^-(e^ q - u -^ - e ei - q - u) ) 
+ p[A(q - u - 1) - A(q - «)] 

+ aE M , o [V a (u+H-A,fr,A,(»7 + £0-,£0 

- V^u + A, £T, A, (»y + (59) 

and 

2 

S(u - 1, 7/) - S(u, 7/) = /5p^-(e e(<z - u+1 ) - e e{q - u) ) 

ft 

+ )8[A(g-« + l)-A(?-tt)] 

+ aE h ^ e [V a (u-l + A,H,A,( V + E)-,E) 

- V a (u + A,H,A,(r] + E)-,E)]. (60) 

Then applying S(u* + l,r]*)-S(u*,r]*) > and S(u*- 
1, 77*) - 77*) > 0, we obtain (24). Similarly, as 

S{u,r]+ 1) - 5(u,?7) = - 
r 

+ aE h , a , e [y Q (u + A, if, 4,(77 + 1 +£)",£) 

- 7 a (u + A, A, (77 + £?)-,£)] (61) 



and 



+ A, H, A, 0(77! +£)" + (!- <73)(r72 



S(u, rj — 1) — S(u, rj) = — — 
r 

+ aE^ a , e [y Q (u + A, A, (77-I + £)-,£) 

- V a (u + A, AT, A, (77 + £)",£)], (62) 
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we can reach d2"51 l from S(u*,rj* + 1) - S(u*,r)*) > and 
S(u*,r)* - 1) - S{u*,rj*) > 0. In addition, 

2 

S(u + 1, v + 1) - »?) = ^p^-(e e («- u - 1 J - e e («-")) 

ft 

+ ~+/3[A(g-u-l)-A(g-u)] 

T 

+ oE w [V Q (u + 1 + A, H, A, (77 + 1 + E)~, E) 
V a (u + A,H,A,(r) + E)-,E)] (63) 



and 



S(u - 1, 7? - 1) - ST(ti, r?) = /3p^-(e 9(9 -" +1) - e e( «- u) ) 

ft 



/3 



+ /3[A(q - u + 1) - A(q - «)] 



+ aE M , [V Q (« - 1 + A, A, (77 - 1 + £7) - , E) 
V a (u + A,H,A,( V + E)-,E)]. (64) 

Then, j26l l can be obtained by applying S(u* — 1,7]* — 
1) - S(u*, rf) > and S(u* + 1, 77* + 1) - S(u*,r)*) > 0. 

Appendix G 

Proof of Lemma ITOl 

Using Property ID we can derive that Zi(u,h,a,r],e) < 
Z\{u + 1, ft, a, 77, e), Zi(it, ft, a, 77, e) < Z\(u, ft., a, 77 + 1, e), 
Z 2 {u,h,a,rj,e) < Z^{u, h, a, 77 + 1, e), Zi(u,h,a,r),e) < 
Z\(u + 1, ft, a, 77, e), and Z^{u, h, a, 77, e) < Zs(u + 
1, ft, a, 77 + l,e)o On the other hand, j23l should be 
satisfied. Thus, given (ft,, a, e), Zi(0, ft, a, r max{0, — 
P(x, ?)},e), Z 2 (0, ft, a, r max{0, — P(x, g)},e), and 
Z 3 (0, ft, a, r max{0, ^ — P(x, q)}, e) are the smallest re- 
spectively. Following the proof of Lemma [9J we can 
prove the first half of the lemma by contradiction. 
Specifically, suppose (0, r max{0, ^- — P(x,q)}) is not the 
optimal solution, then S(u* - 1,77*) - S(u*,r]*) > or 
S(u*,rj* - 1) - S(u*,ri*) > should hold. We have 

Zi(0. h, a, r max{0, q)}, e) 

T 

2 

< Zi(u*,h,a,r)*,e) < pp^e 8q (e e - 1) 



or 



^(0, ft, a, rmax{0, P(x, q)}, e) 

T 



< Z 2 (u* , ft, a, 77* , e) < 



-j8 



and the contradiction occurs. 

We can prove the second half of the lemma sim- 
ilarly by using contradiction. First, given (ft, a, e), 
Zi(q, ft, a, e&, e) and Z 2 (q, ft, a, e&, e) are the largest values 
of Zi and Z 2 , respectively. Assume (q, et) is not the 
optimal solution, then S(u* + 1,77*) - S(u*,rj*) > 

14. It is assumed that a% h>a>e [Gx{q + A,H,A,(r) + E)~,E)] - 
e- e aE hta ^[G-i{q - 1 + A,'-H",A,(rj + E)~,E)] > f3C and 
<*%, a ,e [Gia(g + A, if, A, (77 + £)] - e~ e cM Ka ^ [G 12 (q - 1 + 
A,H,A, (77 + £)-,£)] + | (1 - e" 9 ) > /3C7. This assumption can be 
definitely satisfied when C is small. 



or ^(m*, 77* + 1) - S(u*,r}*) > should be satisfied. 
Consequently, we get 

Zi(q,h,a,eb,e) > Z\(u* + 1, ft, a, 77*, e) 



or 



Z 2 (q,h,a,eb,e) > Z 2 (u 1 h,a,r]* + l,e) > . 

r 

The contradiction occurs then. 



Appendix H 
Proof of Lemma [13 



The proof is based on the results of [26 1. We should prove 
that for some (3, the optimal stationary policy tt* of UP,3 
satisfies 1) n* yields B w " and K~** as limits for all x G X; 
2) K* — P. Observe that limsup and liminf are equal 
for each f3 > (since the controlled chain is ergodic and 
the policy is stationary El ). 
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